optimize u128/i128 popcounts further #354

sadlerap · 2023-10-18T00:58:14Z

Don't fall back on breaking apart the popcount operation if 128-bit integers are natively supported.

antoyo · 2023-10-18T01:13:06Z

Thanks!

Do you think it would be worth trying to make it work even for non-native integers by using the operations from the int module as I suggested earlier?

sadlerap · 2023-10-18T01:14:28Z

I'll give it a try - if it ends up working, I'll replace what I've posted here with whatever works. If not, I'll let you know what fails.

sadlerap · 2023-10-18T01:47:00Z

I think I got a bug in libgccjit.

When I compile with this patch, I get the following error:

Details

rustc_codegen_gcc is build for rustc 1.75.0-nightly (97c81e1b5 2023-10-07) but the default rustc version is rustc 1.75.0-nightly (49691b1f7 2023-10-16).
Using rustc 1.75.0-nightly (97c81e1b5 2023-10-07).
   Compiling minmax_test v0.1.0 (/home/andy/repos/minmax_test)
libgccjit.so: warning: command-line option ‘-x86-asm-syntax=intel’ is valid for Modula-2 but not for
error: rustc interrupted by SIGSEGV, printing backtrace

/home/andy/.rustup/toolchains/nightly-2023-10-08-x86_64-unknown-linux-gnu/lib/librustc_driver-3558b9258144fc36.so(+0x2aff526)[0x7f00466ff526]
/usr/lib/libc.so.6(+0x3e710)[0x7f00438c6710]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x592255)[0x7f0038392255]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x592f78)[0x7f0038392f78]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x597d84)[0x7f0038397d84]

### cycle encountered after 5 frames with period 6
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x592780)[0x7f0038392780]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x59e41d)[0x7f003839e41d]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x5930a4)[0x7f00383930a4]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x595f8d)[0x7f0038395f8d]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x595fc7)[0x7f0038395fc7]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x597e87)[0x7f0038397e87]
### recursed 41 times

/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x592780)[0x7f0038392780]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x59e41d)[0x7f003839e41d]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x5930a4)[0x7f00383930a4]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x595f8d)[0x7f0038395f8d]
/home/andy/repos/gcc/build/gcc/libgccjit.so.0(+0x595fc7)[0x7f0038395fc7]

note: rustc unexpectedly overflowed its stack! this is a bug
note: maximum backtrace depth reached, frames may have been lost
note: we would appreciate a report at https://github.com/rust-lang/rust
note: backtrace dumped due to SIGSEGV! resuming signal
error: could not compile `minmax_test` (lib)

Caused by:
  process didn't exit successfully: `/home/andy/.rustup/toolchains/nightly-2023-10-08-x86_64-unknown-linux-gnu/bin/rustc --crate-name minmax_test --edition=2021 src/lib.rs --error-format=json --json=diagnostic-rendered-ansi,artifacts,future-incompat --diagnostic-width=141 --crate-type lib --emit=dep-info,metadata,link -C opt-level=3 -C embed-bitcode=no --emit asm -Ccodegen-units=1 -C llvm-args=-x86-asm-syntax=intel -Cdebuginfo=2 -C metadata=19b01f69af993630 -C extra-filename=-19b01f69af993630 --out-dir /home/andy/repos/minmax_test/target/release/deps -L dependency=/home/andy/repos/minmax_test/target/release/deps -Csymbol-mangling-version=v0 -Cdebuginfo=2 -Clto=off -Zcodegen-backend=/home/andy/repos/rustc_codegen_gcc/target/release/librustc_codegen_gcc.so --sysroot /home/andy/repos/rustc_codegen_gcc/build_sysroot/sysroot` (signal: 11, SIGSEGV: invalid memory reference)

Cargo failed with exit status: 101

antoyo · 2023-10-18T12:06:11Z

Indeed. Good find!
You can see in the CI that it's because you use an array for a binary operation.
I presume it's in the call to add_assignment_op().
libgccjit doesn't check that the values in this call are numeric.
I'll fix this!
Thanks!

sadlerap · 2023-10-18T13:27:38Z

hmm, I wonder if replacing the assignment ops with just plain assignments would work. I'll give it a try.

antoyo · 2023-10-19T00:18:08Z

I added the type check in this commit.

Yeah, that might work indeed.

sadlerap · 2023-10-19T00:28:44Z

Sadly, while subbing out assignment ops with pure assignments works around the bug, it also results in poor codegen when 128-bit integers aren't supported and the fallback is skipped.

Details

Before:

popcount_128:

        xor eax, eax
        popcnt rax, rsi
        test rsi, rsi
        cmove rax, rsi

        xor ecx, ecx
        popcnt rcx, rdi
        mov rdx, rax
        add rax, rcx

        test rdi, rdi
        cmove rax, rdx

        ret

After:

popcount_128:

        push r14

        mov rax, rsi
        push r13
        push r12
        push rbp
        push rbx
        sub rsp, 32

        or rax, rdi
        mov QWORD PTR 16[rsp], rdi

        mov QWORD PTR 24[rsp], rsi
        vmovdqa xmm3, XMMWORD PTR 16[rsp]
        vmovdqa XMMWORD PTR [rsp], xmm3
        je .L67
        mov rbp, rsi

        mov rbx, rdi

        xor r12d, r12d

.L66:
        mov r14, QWORD PTR [rsp]
        mov r13, QWORD PTR 8[rsp]
        xor ecx, ecx
        mov edx, 1
        add r12d, 1
        mov rdi, r14
        mov rsi, r13
        call __rust_u128_sub@PLT

        xor ecx, ecx
        mov edx, 1
        mov rdi, r14
        mov rsi, r13
        and rbx, rax
        mov QWORD PTR 16[rsp], rax
        call __rust_u128_sub@PLT

        vmovq xmm0, QWORD PTR 16[rsp]
        mov rax, rbx
        and rbp, rdx
        vpinsrq xmm0, xmm0, rdx, 1
        vpand xmm1, xmm0, XMMWORD PTR [rsp]
        or rax, rbp
        vmovdqa XMMWORD PTR [rsp], xmm1
        jne .L66
        mov eax, r12d
.L63:
        add rsp, 32
        pop rbx

        pop rbp

        pop r12

        pop r13
        pop r14

        ret
.L67:

        xor eax, eax
        jmp .L63

However, when 128-bit integers are supported, we do see the improvement as expected.

Details

Before:

popcount_128:

        xor eax, eax
        popcnt rax, rsi
        test rsi, rsi
        cmove rax, rsi

        xor ecx, ecx
        popcnt rcx, rdi
        mov rdx, rax
        add rax, rcx

        test rdi, rdi
        cmove rax, rdx

        ret

After:

popcount_128:

        mov rdx, rdi
        or rdx, rsi

        je .L65

        xor ecx, ecx

        popcnt rsi, rsi

        popcnt rcx, rdi
        lea eax, [rsi+rcx]
        ret
.L65:

        xor eax, eax
        ret

I think if we gate the fallback behind native support for 128-bit integers, we'll get the best of both worlds - good-ish codegen when 128-bit integers are supported, and a roughly as-good fallback when they aren't.

antoyo · 2023-10-19T00:47:33Z

Do you know (or did you do some benchmarks to check) if having one jump is indeed faster than the 2 conditional moves?

sadlerap · 2023-10-19T02:34:34Z

I wrote some microbenchmarks with criterion, but they seemed to get stuck on the anaylsis step (let it run for about 15 minutes and it still wasn't done, whereas the same code compiled with llvm completed in a few seconds). I'll try some other options, see if I get more luck there.

sadlerap · 2023-10-19T03:05:56Z

Alright, I think I have some solid numbers (at least for my machine).

My benchmarks run popcount 1000 times on the same number and add the results together. This is done to inflate iteration times a bit - with an operation as small as this, it's hard to get good numbers for how long it takes to complete.

Before:

test tests::bench_popcount_u128_0       ... bench:         668 ns/iter (+/- 57)
test tests::bench_popcount_u128_1       ... bench:         680 ns/iter (+/- 9)
test tests::bench_popcount_u128_10      ... bench:         683 ns/iter (+/- 12)
test tests::bench_popcount_u128_minus_1 ... bench:         705 ns/iter (+/- 16)

After:

test tests::bench_popcount_u128_0       ... bench:         580 ns/iter (+/- 5)
test tests::bench_popcount_u128_1       ... bench:         592 ns/iter (+/- 91)
test tests::bench_popcount_u128_10      ... bench:         595 ns/iter (+/- 186)
test tests::bench_popcount_u128_minus_1 ... bench:         593 ns/iter (+/- 12)

Source:

#![feature(test)]

pub fn popcount_128(a: u128) -> u32 {
    a.count_ones()
}

#[cfg(test)]
pub mod tests {
    use super::*;
    use test::{Bencher, black_box};

    #[bench]
    fn bench_popcount_u128_0(b: &mut Bencher) {
        let x = black_box(0u128);
        b.iter(|| (0..1000).fold(0, |old, _new| old + popcount_128(black_box(x))))
    }

    #[bench]
    fn bench_popcount_u128_1(b: &mut Bencher) {
        let x = black_box(1u128);
        b.iter(|| (0..1000).fold(0, |old, _new| old + popcount_128(black_box(x))))
    }

    #[bench]
    fn bench_popcount_u128_10(b: &mut Bencher) {
        let x = black_box(10u128);
        b.iter(|| (0..1000).fold(0, |old, _new| old + popcount_128(black_box(x))))
    }

    #[bench]
    fn bench_popcount_u128_minus_1(b: &mut Bencher) {
        let x = black_box(-1i128 as u128);
        b.iter(|| (0..1000).fold(0, |old, _new| old + popcount_128(black_box(x))))
    }

}

Not sure if other processor types will see similar results - I'm on a R7 5800X, but you might see different results on Intel processors/older AMD processors.

I think a lot of the branch/conditional moves will go away once #351 gets resolved, so this might not even matter in the long term.

Don't fall back on breaking apart the popcount operation if 128-bit integers are natively supported. Signed-off-by: Andy Sadler <[email protected]>

antoyo · 2023-10-19T11:43:01Z

Thanks a lot for your work!

sadlerap mentioned this pull request Oct 18, 2023

optimize popcount implementation #348

Merged

sadlerap force-pushed the popcount-128-bits branch from 3d97550 to 77c2ccc Compare October 19, 2023 00:30

optimize u128/i128 popcounts further

81c1f39

Don't fall back on breaking apart the popcount operation if 128-bit integers are natively supported. Signed-off-by: Andy Sadler <[email protected]>

sadlerap force-pushed the popcount-128-bits branch from 77c2ccc to 81c1f39 Compare October 19, 2023 03:10

antoyo merged commit 2096606 into rust-lang:master Oct 19, 2023

sadlerap deleted the popcount-128-bits branch October 19, 2023 11:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize u128/i128 popcounts further #354

optimize u128/i128 popcounts further #354

sadlerap commented Oct 18, 2023

antoyo commented Oct 18, 2023

sadlerap commented Oct 18, 2023

sadlerap commented Oct 18, 2023

antoyo commented Oct 18, 2023 •

edited

Loading

sadlerap commented Oct 18, 2023

antoyo commented Oct 19, 2023

sadlerap commented Oct 19, 2023

antoyo commented Oct 19, 2023

sadlerap commented Oct 19, 2023

sadlerap commented Oct 19, 2023 •

edited

Loading

antoyo commented Oct 19, 2023

optimize u128/i128 popcounts further #354

optimize u128/i128 popcounts further #354

Conversation

sadlerap commented Oct 18, 2023

antoyo commented Oct 18, 2023

sadlerap commented Oct 18, 2023

sadlerap commented Oct 18, 2023

antoyo commented Oct 18, 2023 • edited Loading

sadlerap commented Oct 18, 2023

antoyo commented Oct 19, 2023

sadlerap commented Oct 19, 2023

antoyo commented Oct 19, 2023

sadlerap commented Oct 19, 2023

sadlerap commented Oct 19, 2023 • edited Loading

antoyo commented Oct 19, 2023

antoyo commented Oct 18, 2023 •

edited

Loading

sadlerap commented Oct 19, 2023 •

edited

Loading